Introduction

Recent advances in omics technologies have revolutionized biological research providing vast amounts of heterogeneous data. The drop in ‘omics’ technologies costs enables collection of data from multiple sources; facilitating the study of gene expression, proteins, metagenomics and metabolites, and to evaluate their relationship with health and diseases status. However, the integration of these heterogeneous datasets is not a trivial task. Several statistical methods have been developed to handle these challenges (Mariette and Villa-Vialaneix 2017,Rohart et al. (2017),Argelaguet et al. (2018)). STATIS-ACT (Plantes 1976, Y. Escoufier, Cazes, and others (1976)) is one of the methods capable of analyzing data arising from several configurations and to compare subspaces.

STATIS is a PCA extension to analyze multiple datasets of variables collected from the same individual or the same variables collected from multiple individuals (Lavit et al. 1994). The method was designed to integrate several datasets into an optimum space called compromise. However, STATIS has some constraint since can only be used with continuous data. In addition, the subspace generated by STATIS only can establish relationships between common elements in all datasets, i. e. observations (samples) or variables (genes/transcripts/proteins/metabolites/microbial communities), leading to additional drawbacks. Neither is it possible to establish relationships between observations and variables, nor can it provide a variable selection strategy.

Here, we present Link-HD, an R package to integrate multiple heterogeneous datasets based on STATIS. Our approach is a generalization of multidimensional scaling, which uses distance matrices for numerical, categorical or compositional variables instead of the raw data. Furthermore, the methodology was enhanced with Regression Biplot (Gabriel 1971,Gower and Hand (1995)) and differential abundance testing to perform variable selection and to identify OTUs that differ between two or more categories.

Methods

Our package was designed to integrate multiple communities from a holistic view. It can deal with compositional data, and it includes centered log ratio and cumulative-sum scaling (CSS) (Paulson et al. 2013) transformations. Link-HD also incorporates variable selection and visualization tools which allows for a better undestunding of communities structure, and, their relationship with target traits.

Link-HD pipeline overview. First, raw data are transformed using cumulative sum scaling (CSS) (Paulson et al. 2013) or centered log ratio transformation (clr) (Aitchison 1982). Second, distance and cross -product matrices are obtained to assess the similarity between datasets. Third, the compromise uses the first eigenvectors from RV to build the common configuration W matrix. The intra-structure step is the eigen-decomposition of W, where each observation is projected onto the common configuration. In this step, clusters of observations can be obtained. Finally, relationships between clusters and targets can identify selected variables responsible for the attained structure. (B) Link-HD includes visualization tools to explore relationships between data. In this example from the rumen microbiota it is shown: cross-product matrices relationships (upper left panel), ‘ruminotypes’ from the consensus structure (upper right panel) and the abundance between variables responsible of the obtained structure (lower panel).

STATIS: a general overview

STATIS-ACT (Plantes 1976, Y. Escoufier, Cazes, and others (1976)) aims is to compare and analyze the relationships between data that share a common collection of observations, but each may have different variables. The data consists of \(k\) matrices \(X_{[k]}\) of \(n \times j_{[k]}\) , where \(n\) is the number of observations and \(j_{[k]}\) the number of variables measured in \(k-th\) matrix. The idea behind STATIS is to obtain a compromise matrix, which is a linear combination of the matrices from the initial tables. The analysis of this matrix gives a two/three -dimensional representation (principal axes maps) that can be used to interpret its structure. The methodology is implemented in three main phases [2]:

  • Inter-structure: It measures the similarity between the \(k\) configurations of the matrices. This phase is equivalent to defining a distance among the corresponding scalar product matrices.

  • Compromise: It generates an optimal weight coefficient for each \(k\) matrix, in order to construct a common structure for all configurations.

  • Intra-structure: It is a generalized PCA analysis of the structure obtained in the previous stage.

STATIS overview. This figure shows the three steps of the STATIS: Inter-structure, Compromise and Intra-structure. In the inter-structure, the similarity dataset structure is analyzed and each \(X_{[k]}\) matrix is then projected onto the first and second components of the \(Q\) matrix. The compromise uses the first eigenvectors from \(Q\) to build the common configuration \(W\) matrix. Finally, the intra-structure step is the eigen-decomposition of \(W\) and each observation is projected onto the first and second components of the \(A\) matrix.

The mathematics behind the methodology can be described as follows. The first task requires computations of \(k\) configurations of symmetrical matrices \(W_{[k]}\) for each \(X_{[k]}\) matrix of dimension \(n \times j_{[k]}\) according to equation ((1)):

\[\begin{equation} \tag{1} W_{[k]}=X_{[k]}M_{[k]}X^\top_{[k]} \end{equation}\]

where \(W_{[k]}\) is a square matrix of size \(n\); \(M_{[k]}\) is a diagonal matrix including the variable weights, which is generally the same matrix for all the \(k\) matrices \(M=M_{[k]}=\frac{1}{j_{[k]}}I_{j_{[k]}}\) where \(I_{j_{[k]}}\) is the identity matrix of size \(j_{[k]}\); and \(X^\top_{[k]}\) denotes the transpose of \(X_{[k]}\) matrix. Then, the similarity between the two configurations \(k\) and \(k^\prime\) can be established computing the Hilbert-Smith product of equation (\(<.>_{HS}\), (Robert and Escoufier 1976)) presented in equation (2) or in its normalized version \(\rho_{k,k^\prime}\).

\[\begin{eqnarray} \tag{2} \langle W_{[k]} \arrowvert W_{[k^\prime]}\rangle_{HS} = \left\{ \begin{array}{cl} tr(DW_{[k]}DW_{[k^\prime]}) & if~k \neq k^\prime\\ ||W_{[k]}||^2 & if~k = k^\prime \end{array} \right.\\ \tag{3} \rho_{k,k^\prime}=\frac{\langle W_{[k]} \arrowvert W_{[k^\prime]} \rangle_{HS}}{\sqrt{\langle W_{[k]} \arrowvert W_{[k]} \rangle_{HS}\langle W_{[k^\prime]} \arrowvert W_{[k^\prime]} \rangle_{HS}}} \end{eqnarray}\]

\(tr(.)\) is the trace of the argument matrix; \(D\) is the square weighted observation matrix of size \(n\), which in general is \(D=\frac{1}{n}I_n\); and \(||.||^2\) is the classic Euclidean norm-2 of the argument matrix. In this context, equation (2) can be geometrically interpreted as the scalar product between two positive semi-definite matrices, which is proportional to the cosine of the angle between them.

The inter-structure’s objective is to decide whether or not there is a common structure to all representative \(W_{[k]}\)’s objects. The scalar products \(\rho_{k,k^\prime}\) can be organized in a \(k \times k\) positive semi-definite matrix \(R\) and their spectral decomposition obtains the similarity analysis among the configurations as a generalized correlation coefficient.

However, the formulation from the (1) has some restrictions, i.e., it only can be computed when all observations are continuous. This constraint can be averted by replacing the scalar product (1) between observations by a generalized distance cross-product matrix. In Link-HD, we implemented euclidean, canberra, pearson, pearsonabs, spearman, spearmanabs, mahalanobis, jaccard, simple matching and Aitchison distances. Then, each \(W_k\) is obtained as follows:

\[\begin{eqnarray} \tag{4} W_{k}=\frac{1}{2} H_k D^2 H'_k \\ \tag{5} H=I-\frac{1}{n} 11' \end{eqnarray}\]

whre \(D^2\) in (4) is a distance matrix computed from the raw data (\(X\)) and \(W_k\) is the cross-product transformation of \(D\).

\(W\), the matrix that summarize all of \(W_{k}\) can be obtained by a weighted mean of the \(W_{[k]}\) configurations (Lavit 1988):

\[\begin{eqnarray} \tag{6} W=\sum_{k=1}^{K}\alpha_{[k]}W_{[k]}\\ \tag{7} \alpha=(1^{T}q_{1})^{-1} \times q_{1}Dataset<-Normalization names(Dataset)<-c("16_S","Archaea","18_S") \end{eqnarray}\]

where, \(\alpha\), is the vector with optimal weights, obtained by re-scaling these values such that their sum is equal to one; \(q_1\), is the first eigenvector of the \(R\) matrix and \(\alpha_{[k]}\), is the weight coefficient for the \(k\) configuration.

The compromise is a cross-product matrix and it can be eigen-decomposed. Thus, it is possible to obtain a graphical representation from their Principal Components Analysis (PCA) where the scores are the best common representation for the set of \(K\) matrices. Moreover, scores can be used to perform a cluster analysis in the common subspace.

Selecting variables

Regression Biplot approach

Classical STATIS is neither designed to establish relationships between observations and variables, nor to perform variable selection. However, the general regression biplot formulation from Gower and Hand (1995) averts these limitations.

\[\begin{equation} X\approx AB^\top+\varepsilon \tag{8} \end{equation}\]

The equation (8) can be understood as a multivariate regression of \(X\) on the coordinates of rows \(A\), when they are fixed, or a multivariate regression of \(X^T\) on the coordinates of the variables \(B\), if they are fixed. Then, we could easily define a regression biplot through a classical linear model, i. e., \(E(X)=AB^\top\) if the response has a normal distribution, or using a generalized linear model \(E(X)=g(\mu)=AB^\top\) when the response variable has an exponential family distribution and the link function used is \(g(.)\) (Greenacre 2010). Here, we used the ideas from (8) to project all variables into the common subspace obtained by the interstructure step. Variables are selected using the accuracy of the regression biplot, which is measured using the proportion of explained variance by each regression (adjusted r squared from the model).

Differential abundance approach

Link-HD also implements a differential abundance analysis which is computed using a rank-based nonparametric Kruskal-Wallis test (Kruskal and Wallis 1952). Variables are selected after applying a procedure that controls the false discovery rate (FDR) (Benjamini and Hochberg 1995).

Usage

Software Installation

You need devtools to install LinkHD

#if you don't have devtools, please install it:
install.packages("devtools")
#if devtools is already installed, ommit previous step
library(devtools)
devtools::install_github(repo ="lauzingaretti/LinkHD" ,dependencies = TRUE)
library(LinkHD)

Congrats! The software is ready to use! You can check how to use it using this manual as well as the help of each function, for instance:

 ?? LinKData()

is the help for the LinkHD main function

Software generalities

The software requires a list of data frames as input including common elements in rows. Read_Data facilitates data reading. This function has only one argument, which is the path to the parent folder which contains all the data to be used in the analysis. Please check the function help for more information. Otherwise, you can read all your dataset as usual (using read.table, read.csv or any read function) and store them into a list. Warning! Please remember to name the list using input data names. Please make sure all data has the same row_names, the main LinkHD function checks row names and it stops if they don’t match.

Examples

To illustrate the usefulness of Link-HD, we used public data from the TARA Ocean expedition (https://oceans.taraexpeditions.org/en/m/about-tara/les-expeditions/tara-oceans/) as well as data from the ruminal metataxonomic communities (including bacteria, archaea and protozoa data).

Results

Applied to TARA Ocean, we have integrated environmental, structural and funcional information from prokaryotic communities. In concordance with the original publication and results obtained with mixKernel (Mariette and Villa-Vialaneix 2017). Link-HD reproduces the relevant role of temperature of the Proteobacteria phyla on the structure of this ecosystem. The next lines show how to perform Tara Ocean integration:

library(LinkHD)
data("Taraoceans")
data(Taraoceans)
pro.phylo <- Taraoceans$taxonomy[ ,"Phylum"]
TaraOc<-list(Taraoceans$phychem,as.data.frame(Taraoceans$pro.phylo),as.data.frame(Taraoceans$pro.NOGs))
TaraOc_1<-scale(TaraOc[[1]])
Normalization<-lapply(list(TaraOc[[2]],TaraOc[[3]]),function(x){DataProcessing(x,Method="Compositional")})
colnames(Normalization[[1]])=pro.phylo
colnames(Normalization[[2]])=Taraoceans$GO
TaraOc<-list(TaraOc_1,Normalization[[1]],Normalization[[2]])
names(TaraOc)<-c("phychem","pro_phylo","pro_NOGs")
TaraOc<-lapply(TaraOc,as.data.frame)
TaraOc<-lapply(TaraOc,as.data.frame)
Tara_Out<-LinkData(TaraOc,Scale =FALSE,Center=FALSE,Distance = c("ScalarProduct","Euclidean","Euclidean"))

The compromise plot from Tara Oceans can be obtained by the next function:

CompromisePlot(Tara_Out)+theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
                                                               panel.background = element_blank(), axis.line = element_line(colour = "black"))

The correlation plot between tables for the Tara Oceans data can be easily obtained by:

CorrelationPlot(Tara_Out) +
 theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = "black"))

As expected, the correlation between pro_NOGs and pro_phylo is higher than with phychemestry variables.

We select variables using Biplot methodology

Selection<-VarSelection(Tara_Out,TaraOc,Crit="Rsquare",perc=0.95)

The following plot only gives the frequency of selected variables from all the tables.

library(reshape2)
## 
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
## The following objects are masked from 'package:reshape':
## 
##     colsplit, melt, recast
Ds<-as.data.frame(table(Selection@Variables))
ggplot(Ds, aes(x=Var1, y = Freq)) +
  geom_bar(stat="identity", fill="steelblue") +
  ggpubr::rotate_x_text(45)

You can find additional analysis and examples about Tara Ocean in the package manual. Now, we will move to the Ruminotypes example.

Ruminotypes analysis

library(LinkHD)
data("Ruminotypes")

In the second example, which is developed in this vignette, metataxonomic data of ruminal communities (including bacteria, archaea and protozoa data) from 65 Holstein cows were analyzed to evaluate the relationship between communities and their role in methane emmision yield (\(CH_{4}y\), g \(CH_{4}\)/kg feed intake).

library("LinkHD")
library("ggplot2")
options(width =50)
data("Ruminotypes")

Data Processing

The DataProcessing function of Link-HD, can be implemented in two differents ways. We recommend use of the ‘Standard’ option if your data is continuous and normally distributed. The ‘Compositional’ option is recommended for microbiome data, due to their compositional nature. Applied to a compositional-like dataset this option implements a centered log-ratios transformation (Aitchison 1982). Finally, if any external function have been used or the dataset is already normalized the user can skip this step and directly run the LinkData function (see below).

Normalization<-lapply(list(Ruminotypes$`16_S`,Ruminotypes$Archaea,Ruminotypes$`18_S`),function(x){DataProcessing(x,Method="Compositional")})
Dataset<-Normalization
names(Dataset)<-c("16_S","Archaea","18_S")

Link-HD also allows as input MultiAssayExperiment data, a Bioconductor constructor that combines multiple data elements from different experiment.

library(LinkHD)
data("Ruminotypes")
Normalization<-lapply(list(Ruminotypes$`16_S`,Ruminotypes$Archaea,Ruminotypes$`18_S`),function(x){DataProcessing(x,Method="Compositional")})
Dataset<-Normalization
names(Dataset)<-c("16_S","Archaea","18_S")
Dataset<-MultiAssayExperiment(experiments = Dataset)

LinkData is the main function of Link-HD, which performs the integration step (i.e. the STATIS methodology incorporating distances options). After data preprocessing (centered log ratio, clr transformation in our example), we can implement a classical Euclidean distance. Since a clr transformation has been done, this step is equivalent to applying the Aitchison distance, which is a more appropriated and interpretable metric for compositional data analysis (Aitchison 1982). Note that our method also allows the implementation of other popular distance metrics in microbial ecology studies such as Bray-Curtis distance. Once the consensus space is defined, we can use the package to explore the relationship with external phenotype information and, therefore identify external sources of variation that may explain the compromise configuration. In our example, and based on preliminary results, samples were stratified according to the described Ruminotype structure (Kittelmann et al. 2013, Danielsson et al. (2017), Ramayo-Caldas et al. (2019)). The algorithm employed in this step was the partitioning around medioids (pam), which is a more robust version of Kmeans (Park and Jun 2009).

Getting common structure

Here, we run the main function of LinkHD, i.e., LinkData, which returns the compromise structure. Since we have applied the clr transformation in the previous step, setting “Euclidean” as distance method is equivalent to computing the Atchison distance for compositional data. The Scale and Center arguments are only useful when your dataset was not previously normalized and they are not useful in a communities analysis context. Finally, the nCluster argument is used to obtain a group structure from the compromise configuration.

Output<-LinkData(Dataset,Distance=rep("euclidean",3),Scale = FALSE,Center=FALSE,nCluster = 3)
#variance explained by compromise coordinates
CompromisePlot(Output)+theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour="black"))

The color indicates the quality of representation of each individual in the two dimensional space from the compromise PCA (more light blue means that the observation is better represented in a low dimentional space).

Optionally, data could be a list of S4 ExpressionSet objects. The advantage of ExpressionSet is their capability to describe samples and features in the experiment, as well as information related to the whole experiment and data-processing. To use ExpressionSet class, please follow the next lines:

B<-ExpressionSet()
Dataset1<-Normalization
A<-lapply(Dataset1,as.matrix)
B<-lapply(A,ExpressionSet)
Output_<-LinkData(B,Distance=rep("euclidean",3),Scale = FALSE,Center=FALSE,nCluster = 3)

Obtaining individual projections

The sample distribution within each input dataset (equivalent to coordinates principal analysis) can be explored using the Globalplot option. This plot is the projection of each individual table into the compromise space.

GlobalPlot(Output)

Relationships between communities

We use the Generalized Multivariate Correlation Coefficient RV to estimate the relationship between each microbial community. The results in the next figure suggests a strong relationship between the prokaryotic communities of Archaea and Bacteria (16_S).

CorrelationPlot(Output)

Sample stratification in ruminotypes-like clusters

In agreement with the original publication that found a ruminotype-like structure for rumen bacteria (Ramayo-Caldas et al. 2019), Link-HD recovers a Ruminotypes-like cluster structure (including archaea, bateria and protozoa) of the rumen ecosystem (next figure). Note that our function is flexible and the algorithm can be run for different configurations or set numbers of cluster. Although, we have not implemented an algorithm to obtain the optimum number of groups, it could be easily done using extant R functions, like e.g., the Silhouette coefficient (Rousseeuw 1987).

mydata=Output@Compromise.Coords
#set colors
cols <- c("1" = "red", "2" = "blue", "3" = "orange")
mydata$fit.cluster<-as.factor(mydata$fit.cluster)
p<-ggplot(mydata, aes(Dim.1,Dim.2)) +
geom_point(aes(colour = fit.cluster))
p + scale_colour_manual(values = cols) + theme_classic() 

The obtained clusters of samples can be used to evaluate their association with external variables (i.e., host phenotype information). Previous studies (Kittelmann et al. 2013,Danielsson et al. (2017),Ramayo-Caldas et al. (2019)) reported a link between the structure of the rumen microbiome and \(CH_4\) emission. Therefore, in our study, we contrast the \(CH_4y\) emission levels between the three clusters of samples. In agreement with the afore mentioned studies, cluster assignation was significantly associated with \(CH_4\) emission. Significant differences between two out of three groups were found, cows clustered within C1 and C2 having lower and higher \(CH_4\) emission.

library(lsmeans)
Dat<-data.frame("y"=Ruminotypes$phenotype$ch4y,"E"=as.factor(mydata$fit.cluster))
s=lm(y~E,data=Dat)
#summary(s)
lsmeans(s, "E")
##  E lsmean    SE df lower.CL upper.CL
##  1   23.6 0.567 58     22.5     24.7
##  2   26.3 0.688 58     25.0     27.7
##  3   23.1 0.650 58     21.8     24.4
## 
## Confidence level used: 0.95
pairs(lsmeans(s, "E"))
##  contrast estimate    SE df t.ratio p.value
##  1 - 2      -2.734 0.891 58 -3.068  0.0091 
##  1 - 3       0.485 0.863 58  0.562  0.8405 
##  2 - 3       3.219 0.946 58  3.401  0.0034 
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

Variable selection

Two alternative methods for variable selection are implemented. The first one is based on regression biplot. From a general point of view, a regression biplot can be understood as the decomposition of a target matrix. Therefore, it could be interpreted as a projection of each variable (i.e., OTUs) onto the compromise structure. To accomplish that goal, we used an external linear model to obtain the regressors coefficients (by projecting all OTUs on the n first compromise axis). Finally, we measure the representation quality using the proportion of explained variance by each regression (\(R^2\)) to select a proportion (user defined) of variables with a higher R-squared. The second approach is based on differential abundance analysis, which is computed through a non-parametric Kruskall Wallis test. In both scenarios the comparison is made using the clr-transformed data with the goal of identifying feautures related with the compromise configuration. Finally, false discovery rate (FDR) is applied to select only variables with an FDR < 0.05 (this level can also be defined by the user). The biplot procedure can be implemented using the Select_Var function as folow. The Crit argument refers to the criteria used to select the variables and the perc is to establish the percentage of variables to be selected. Intercept is set to F

Select_Var<-VarSelection(Output,Data=Dataset,Crit = "Rsquare",perc=0.9)

Meanwhile, the differential abundance procedure is implemented in the dAB function:

M<-dAB(Output,Data=Dataset)

Taxonomic aggregation and Taxon Set Enrichment Analysis

To facilitate the biological interpretation of the results (after variable selection) the taxonomic classification of the OTUs can be used to aggregate the OTUs at the user defined taxonomic level. We developed a function to establish whether a group of selected OTUs is ‘enriched’ for a particular taxonomic level (i.e., family or genus) by using the cumulative hypergeometric distribution. This function is called OTU2Taxa and can be used in the list of OTUs selected by the two approaches (biplot or differential abundance)

The next lines split selected variables from Biplot methodology by each table, i.e., which selected variables come from of Archaea, Bacteria and protozoa.

#Select only the selected variables form Select_Var object
SelectedB_16S<-Select_Var@Variables[Select_Var@VarTable=="16_S"]
SelectedB_Archaea<-Select_Var@Variables[Select_Var@VarTable=="Archaea"]
SelectedB_18S<-Select_Var@Variables[Select_Var@VarTable=="18_S"]

The same as before, but for selected variables from the dAB function:

Selec_16S<-as.matrix(Ruminotypes[[1]][,colnames(Dataset[[1]])%in%names(M[[1]])])
Selec_Archaea<-as.matrix(Ruminotypes[[2]][,colnames(Dataset[[2]])%in%names(M[[2]])])
Selec_18S<-as.matrix(Ruminotypes[[3]][,colnames(Dataset[[3]])%in%names(M[[3]])])

The next figure represents the bacterial profile of the most important variables according to the Biplot procedure aggregated at family level.

library(reshape2)
SignTaxa<-OTU2Taxa(Selection=Select_Var@VarTable,TaxonInfo=Ruminotypes$Taxa_16S,tableName="16_S",AnalysisLev = "Family")
melted_Table <- melt(SignTaxa$TotalUp1)
ggplot(data =melted_Table, aes(x= reorder(rownames(melted_Table), -value), y=value)) +
  scale_fill_gradient(low = "steelblue",
                      high = "#63D2E4",na.value="steelblue") +
  geom_bar(stat="identity", fill="#63D2E4")+
  theme_classic()+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90)+
  theme(axis.text.x=element_text(size=rel(1.5)))

The following figure shows the Bacteria family taxonomy of most important variables according to the differential abundance analysis.

SignTaxa<-OTU2Taxa(Selection=M,TaxonInfo=Ruminotypes$Taxa_16S,tableName="16_S",AnalysisLev = "Family")
melted_Table <- melt(SignTaxa$TotalUp1)
ggplot(data =melted_Table, aes(x= reorder(rownames(melted_Table), -value), y=value)) +
  scale_fill_gradient(low = "steelblue",
                      high = "#FF0066",na.value="steelblue") +
  theme_classic()+
  geom_bar(stat="identity", fill=  "#56B4E9" )+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90) +
  theme(axis.text.x=element_text(size=rel(0.9)))

After aggregating the selected variables we can also explore their relative abundance patterns (i.e., most relevant genera/family) and distribution between clusters. For instance, in our example, it is reasonable to expect the highest differences between samples classified as C1 and C3. The next examples are focused on members of Ruminococcaceae, Christensenellaceae and Succinivibrionaceae family, but the user can extend it to other family or taxonomic levels:

Rumin<-which(Ruminotypes$Taxa_16S[Ruminotypes$Taxa_16S$OTUID%in%SelectedB_16S,]$Family=="Ruminococcaceae")
OTUs<-Ruminotypes$Taxa_16S[Ruminotypes$Taxa_16S$OTUID%in%SelectedB_16S,][Rumin,]$OTUID
Rumin_Tot<-Dataset[['16_S']][,colnames(Dataset[['16_S']])%in%OTUs]
RuminSum<-apply(Rumin_Tot,1,sum)

Total<-data.frame(RuminSum,"Group"=as.factor(Output@Compromise.Coords$fit.cluster))
p <- ggplot(Total, aes(x=Group, y=RuminSum, fill=Group)) +
  geom_boxplot()+
  scale_fill_manual(values=c("Tomato", "Orange", "DodgerBlue"))+
  theme_classic()+ ggtitle(paste("Differences at level",rownames(melted_Table)[16]))
p

Christ<-which(Ruminotypes$Taxa_16S[Ruminotypes$Taxa_16S$OTUID%in%SelectedB_16S,]$Family==rownames(melted_Table)[14])
OTUs<-Ruminotypes$Taxa_16S[Ruminotypes$Taxa_16S$OTUID%in%SelectedB_16S,][Christ,]$OTUID
Christ_Tot<-Dataset[['16_S']][,colnames(Dataset[['16_S']])%in%OTUs]
ChristSum<-apply(Christ_Tot,1,sum)
Total<-data.frame(ChristSum,"Group"=as.factor(Output@Compromise.Coords$fit.cluster))
p <- ggplot(Total, aes(x=Group, y=ChristSum, fill=Group)) +
  geom_boxplot()+
  scale_fill_manual(values=c("Tomato", "Orange", "DodgerBlue"))+
  theme_classic()+ ggtitle(paste("Differences at level",rownames(melted_Table)[14]))
p

The following example represents an alternative view to the previous plot:

ggplot(Total, aes(x=Group, y=rownames(Total), fill=ChristSum)) +
  geom_tile() + scale_fill_gradient(low = "steelblue",
                                    high = "#FF0066",na.value="steelblue") +
  ggpubr::rotate_x_text(45)+
  theme_classic() +
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  theme(axis.text.y=element_text(size=rel(0.6)))

As illustrated in the previous figure, samples from the cluster 2 have a higher frequency of OTUs members of the family Christensenellaceae.

Succi<-which(Ruminotypes$Taxa_16S[Ruminotypes$Taxa_16S$OTUID%in%SelectedB_16S,]$Family=="Succinivibrionaceae")
OTUs<-Ruminotypes$Taxa_16S[Ruminotypes$Taxa_16S$OTUID%in%SelectedB_16S,][Succi,]$OTUID
Succi_Tot<-Dataset[['16_S']][,colnames(Dataset[['16_S']])%in%OTUs]
SucciSum<-apply(Succi_Tot,1,sum)

Total<-data.frame(SucciSum,"Group"=as.factor(Output@Compromise.Coords$fit.cluster))
Total<-Total[- which(Total[,1]>1),]
p <- ggplot(Total, aes(x=Group, y=SucciSum, fill=Group)) +
  geom_boxplot()+
  scale_fill_manual(values=c("#e46563", "#357fe3", "#75e463"))+
  theme_classic()+ ggtitle(paste("Differences at level","Succinivibrionaceae"))+
   theme(axis.text.x=element_text(size=rel(1.5)),legend.title=element_text(size=rel(1.5)), 
        legend.text=element_text(size=rel(1.5)))
p

In agreetment with previous results (Kittelmann et al. 2013, Danielsson et al. (2017), Ramayo-Caldas et al. (2019)), members of Succinivibrionaceae showed a higher abundance on the lower CH4 emitting cows. Please note the agreement between variables selected by the two approaches. For instance, the three most relevant families associated with Ch4y emission explains the compromise configuration (Ruminococcaceae, Anaerolineaceae and Succinivibrionaceae). Focusing on the protozoa dataset, the next figure represents the most important variables at the Genus level according to the regression biplot:

SignTaxa<-OTU2Taxa(Selection=Select_Var@VarTable,TaxonInfo=Ruminotypes$Taxa_18S,tableName="18_S")
melted_Table <- melt(SignTaxa$TotalUp1)
ggplot(data =melted_Table, aes(x= reorder(rownames(melted_Table), -value), y=value)) +
  scale_fill_gradient(low = "steelblue",
                      high = "#B848DB",na.value="steelblue") +
  geom_bar(stat="identity", fill="#B848DB")+
  theme_classic()+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90)+
  theme(axis.text.x=element_text(size=rel(1.5)))

Meanwhile, as suggested by the differential abundance the most relevant genera at 18S rRNA gene level were:

SignTaxa<-OTU2Taxa(Selection=M,TaxonInfo=Ruminotypes$Taxa_18S,tableName="18_S")
melted_Table <- melt(SignTaxa$TotalUp1[SignTaxa$TotalUp1>1])
ggplot(data =melted_Table, aes(x= reorder(rownames(melted_Table), -value), y=value)) +
scale_fill_gradient(low = "steelblue",
                      high = "#FF0066",na.value="steelblue") +
  geom_bar(stat="identity", fill=  "#56B4E9" )+
  theme_classic()+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90)+
  theme(axis.text.x=element_text(size=rel(0.8)))

In the next boxplot we represent the cluster distribution of the Diplodinim genus relative abundance.

Dip<-which(Ruminotypes$Taxa_18S[Ruminotypes$Taxa_18S$OTUID%in%SelectedB_18S,]$Genus=="Diplodinium")
OTUs<-Ruminotypes$Taxa_18S[Ruminotypes$Taxa_18S$OTUID%in%SelectedB_18S,][Dip,]$OTUID
Dip_Tot<-Dataset[['18_S']][,colnames(Dataset[['18_S']])%in%OTUs]
DipSum<-apply(Dip_Tot,1,sum)

Total<-data.frame(DipSum,"Group"=as.factor(Output@Compromise.Coords$fit.cluster))


p <- ggplot(Total, aes(x=Group, y=DipSum, fill=Group)) +
  geom_boxplot()+
  scale_fill_manual(values=c("Tomato", "Orange", "DodgerBlue"))+
  theme_classic()+ ggtitle(paste("Differences at level","Diplodinium"))
p

An alternative visualization of the previous result:

ggplot(Total, aes(x=Group, y=rownames(Total), fill=DipSum)) +
  geom_tile() + scale_fill_gradient(low = "steelblue",
                                    high = "#FF0066",na.value="steelblue") +
  ggpubr::rotate_x_text(45)+
  theme_classic() +
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  theme(axis.text.y=element_text(size=rel(0.6)))

We noted similar pattern to the one observed at 16S rRNA gene level. In both approaches the top relevant genera were the same: Diplodinium, Ophryoscolex, Tetratrichomonas, Tritrichomonas.

Finally, at the Archaea level we also obtained a high degree of agreenment between the two ways implemented for variable selection.

SignTaxa<-OTU2Taxa(Selection=Select_Var@VarTable,TaxonInfo=Ruminotypes$Taxa_Archaea,tableName="Archaea")
melted_Table <- melt(SignTaxa$TotalUp1)
ggplot(data =melted_Table, aes(x= reorder(rownames(melted_Table), -value), y=value)) +
  scale_fill_gradient(low = "steelblue",
                      high = "#46D6AD",na.value="steelblue") +
  geom_bar(stat="identity", fill="#46D6AD")+
  theme_classic()+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90)+
  theme(axis.text.x=element_text(size=rel(1.5)))

SignTaxa<-OTU2Taxa(Selection=M,TaxonInfo=Ruminotypes$Taxa_Archaea,tableName="Archaea")
melted_Table <- melt(SignTaxa$TotalUp1)
ggplot(data =melted_Table, aes(x= reorder(rownames(melted_Table), -value), y=value)) +
  scale_fill_gradient(low = "steelblue",
                      high = "#FF0066",na.value="steelblue") +
  geom_bar(stat="identity", fill=  "#56B4E9" )+
  theme_classic()+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90)+
  theme(axis.text.x=element_text(size=rel(0.9)))

Differences in the relative abundance of Methanobrevibacter between the clusters of samples are represented in the next figure. As expected, lower relative abundance of Methanobrevibacter was observed in the cluster of the lower emitting samples.

Met<-which(Ruminotypes$Taxa_Archaea[Ruminotypes$Taxa_Archaea$OTU%in%SelectedB_Archaea,]$Genus==levels(Ruminotypes$Taxa_Archaea$Genus)[4])
OTUs<-Ruminotypes$Taxa_Archaea[Ruminotypes$Taxa_Archaea$OTU%in%SelectedB_Archaea,][Met,]$OTU
Met_Tot<-Dataset[['Archaea']][,colnames(Dataset[['Archaea']])%in%OTUs]
MetSum<-apply(Met_Tot,1,sum)
Total<-data.frame(MetSum,"Group"=as.factor(Output@Compromise.Coords$fit.cluster))
p <- ggplot(Total, aes(x=Group, y=MetSum, fill=Group)) +
  geom_boxplot()+
  scale_fill_manual(values=c("Tomato", "Orange", "DodgerBlue"))+
  theme_classic()+ ggtitle(paste("Differences at level","Methanobrevibacter"))
p

ggplot(Total, aes(x=Group, y=rownames(Total), fill=MetSum)) +
  geom_tile() + scale_fill_gradient(low = "steelblue",
                                    high = "#FF0066",na.value="steelblue") +
  ggpubr::rotate_x_text(45)+
  theme_classic() +
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  theme(axis.text.y=element_text(size=rel(0.6)))

Taxon set enrichment analysis from a user-defined OTUs’ list

OTU2Taxa function can be used to aggregate OTUs into their taxonomic characteristics and to compute the most significant (“enrichment”) genera/families to any OTUs’ list, i.e., a list obtained by another methodology. For instance, in Ramayo-Caldas et al. (2019) the authors selected two OTUs list by applying SPLS and Discriminant SPLS (SPLS DA). Here, we show how to use these external dataset with OTU2Taxa:

library("readxl")
library("XLConnect")
library("openxlsx")

# Download the list from supplementary material: 
temp = tempfile(fileext = ".xls")
dataURL <- "https://onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1111%2Fjbg.12427&file=jbg12427-sup-0004-TableS4.xls"
download.file(dataURL, destfile=temp, mode='wb')
SPLS_DA<-read_excel(temp,sheet=1)
SPLS<-read_excel(temp,sheet=2)

#Create a list with selected OTU. Note that if you have only one table, the list should have lenght equal to 1. 
Selected_OTU<-list("SPLS"=SPLS$OTU,"SPLS_DA"=SPLS_DA$OTU)
# OTUs have to be the lists names:
names(Selected_OTU$SPLS)<-SPLS$OTU
names(Selected_OTU$SPLS_DA)<-SPLS_DA$OTU


#TaxonInfo is the same than 16_S taxon info into Ruminotypes dataset: 
#Apply OTU2Taxa
SignTaxa<-OTU2Taxa(Selection=Selected_OTU,TaxonInfo=Ruminotypes$Taxa_16S,tableName="SPLS",AnalysisLev = "Family")
SignTaxa2<-OTU2Taxa(Selection=Selected_OTU,TaxonInfo=Ruminotypes$Taxa_16S,tableName="SPLS_DA",AnalysisLev = "Family")



melted_Table <- melt(SignTaxa2$TotalUp1)
ggplot(data =melted_Table, aes(x= reorder(rownames(melted_Table), -value), y=value)) +
  scale_fill_gradient(low = "steelblue",
                      high = "#FF0066",na.value="steelblue") +
  theme_classic()+
  geom_bar(stat="identity", fill=  "#ff661a" )+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90) +
  theme(axis.text.x=element_text(size=rel(0.9)))

Bonus track 1: a network based approach to better understand relationships between selected variables

The structure and interaction patterns between microbial communities can be effectively studied using network-based approaches (Faust and Raes 2012, Lima-Mendez et al. (2015), Berry and Widder (2014), Ramayo-Caldas et al. (2016)). We implemented a network approach to illustrate a downstream analysis that can be done to better understand the association between most relevant features within and across communities. Here, we used SPIEC-EASI (Kurtz et al. 2015) to infer the interaction networks of selected variables and igraph to represent the resulting interaction patterns. SPIEC-EASI is a computational tool specifically designed to deal with compositional data, which avoids the compositional effects bias of classical correlation methods (Aitchison, Kay, and others 2003, Pawlowsky-Glahn and Buccianti (2011) Kurtz et al. (2015)). In the first step, the analysis is done within each community and then using all communities in a single analysis.

library(SpiecEasi)
#use only selected OTUs from cunts tables 

Selec_16S<-as.matrix(Ruminotypes[[1]][,colnames(Dataset[[1]])%in%names(M[[1]])])
Selec_Archaea<-as.matrix(Ruminotypes[[2]][,colnames(Dataset[[2]])%in%names(M[[2]])])
Selec_18S<-as.matrix(Ruminotypes[[3]][,colnames(Dataset[[3]])%in%names(M[[3]])])
se.mb.16S <- spiec.easi(Selec_16S, method='mb', lambda.min.ratio=1e-2, nlambda=20, pulsar.params=list(rep.num=50))
se.mb.Archaea <- spiec.easi(Selec_Archaea, method='mb', lambda.min.ratio=1e-2, nlambda=20, pulsar.params=list(rep.num=50))
se.mb.18S <- spiec.easi(Selec_18S, method='mb', lambda.min.ratio=1e-2, nlambda=20, pulsar.params=list(rep.num=50))
se.mb.All <- spiec.easi(cbind(Selec_16S,Selec_Archaea,Selec_18S), method='mb', lambda.min.ratio=1e-2, nlambda=20, pulsar.params=list(rep.num=50))

Interaction patterns at the Bacterial level

library(igraph)
## Define arbitrary threshold for SparCC correlation matrix for the graph
## Create igraph objects
ig.mb <- adj2igraph(getRefit(se.mb.16S))
## set size of vertex proportional to clr-mean
vsize=degree(ig.mb, mode="all") + 5
am.coord <- layout.fruchterman.reingold(ig.mb)
plot(delete.vertices(simplify(ig.mb),degree(ig.mb)==0), layout=am.coord, vertex.size=vsize, vertex.label=NA, main="16S i interactions",edge.color=ifelse(as.matrix(se.mb.16S$select$est$beta[[20]])>0, "green","red"))

Interaction patterns at the Protozoa level

## Define arbitrary threshold for SparCC correlation matrix for the graph
## Create igraph objects
ig.mb <- adj2igraph(getRefit(se.mb.18S))
## set size of vertex proportional to clr-mean
vsize=degree(ig.mb, mode="total")+5
am.coord <- layout.fruchterman.reingold(ig.mb)
plot(delete.vertices(simplify(ig.mb),degree(ig.mb)==0), layout=am.coord, vertex.size=vsize, vertex.label=NA, main="18S interactions",edge.color=ifelse(as.matrix(se.mb.18S$select$est$beta[[20]])>0, "green","red"))

Interaction patterns at the Archaea level:

## Define arbitrary threshold for SparCC correlation matrix for the graph
## Create igraph objects
ig.mb <- adj2igraph(getRefit(se.mb.Archaea))
## set size of vertex proportional to clr-mean
vsize=degree(ig.mb, mode="all")+5
am.coord <- layout.fruchterman.reingold(ig.mb)
plot(delete.vertices(simplify(ig.mb),degree(ig.mb)==0), layout=am.coord, vertex.size=vsize, vertex.label=NA, main="Archaea interactions",edge.color=ifelse(as.matrix(se.mb.Archaea$select$est$beta[[20]])>0, "green","red"))

Across communities interaction patterns

Finally, In the next figure, we display the relationships between the selected OTUs across the three communities. Node color represents the community: Bacterial (red nodes), Archaeal (green) and protozoal (blue). The node size is a function of the number of interactions. Based on the topological properties of the network, our results suggest a high level of connectivity between the three communities. Furthermore, in agreement with their relevant role on methanogenesis, OTUs from the Archaea communities have a higher number of interactions.

## Create igraph objects
ig.mb <- adj2igraph(getRefit(se.mb.All))
## set size of vertex proportional to clr-mean
vsize=degree(ig.mb, mode="all") +5 
am.coord <- layout.fruchterman.reingold(ig.mb)
Colors_by_community<-c(rep("#a8f7a6",ncol(Selec_16S)),rep("#2e2ee1",ncol(Selec_18S)),rep("#e5728c",ncol(Selec_Archaea)))
V(ig.mb)$color<-Colors_by_community
plot(delete.vertices(simplify(ig.mb),degree(ig.mb)==0), layout=am.coord, vertex.size=vsize, vertex.label=NA, main="All interactions",edge.color=ifelse(as.matrix(se.mb.All$select$est$beta[[20]])>0,"green","red"))

Bonus track 2: comparing results with mixKernel

Finally, we compared our results with mixKernel output from Statis and full-UMKL kernels.

library("LinkHD")
library("ggplot2")
library("mixKernel")
library("mixOmics")
options(width =50)
Dataset<-Normalization
names(Dataset)<-c("16_S","Archaea","18_S")

M16.kernel = compute.kernel(Dataset$'16_S')
Arch.kernel = compute.kernel(Dataset$'Archaea')
M18.kernel = compute.kernel(Dataset$'18_S')


#FULL-Statis
meta.kernel = combine.kernels(M16 = M16.kernel, M18 = M18.kernel, Arch = Arch.kernel, method = "STATIS-UMKL")
kernel.pca.result = kernel.pca(meta.kernel, ncomp = 10)

The correlations between compromise configuration from mixkernel and Link-HD were obtained using procrustes methodology from the vegan library.

mixkCompromise <- as.matrix(meta.kernel$"kernel")
linkhdCompromise <- as.matrix(Output@Compromise.Matrix)

protest(mixkCompromise, linkhdCompromise)
## 
## Call:
## protest(X = mixkCompromise, Y = linkhdCompromise) 
## 
## Procrustes Sum of Squares (m12 squared):        0.1331 
## Correlation in a symmetric Procrustes rotation: 0.9311 
## Significance:  0.001 
## 
## Permutation: free
## Number of permutations: 999

Selected variables from mixKernel at family level for 16-S are:

set.seed(17051753)
#Permuting Bacteria
kernel.pca.resultM16 = kernel.pca.permute(kernel.pca.result, ncomp = 1, M16 = colnames(Dataset$'16_S'))
M16relevance <- data.frame(kernel.pca.resultM16$cc.variables, kernel.pca.resultM16$cc.distances)
colnames(M16relevance) <- c("OTU", "Relevance")
M16relevance <- M16relevance[order(-M16relevance[,2]),] 
Rel<-M16relevance[1:120,]
melted_Table <- melt(table(Ruminotypes$Taxa_16S[Ruminotypes$Taxa_16S$OTUID%in%Rel$OTU,]$Family))
ggplot(data =melted_Table, aes(x= reorder(Var.1, -value), y=value)) +
  scale_fill_gradient(low = "steelblue",
                      high = "#FF0066",na.value="steelblue") +
  geom_bar(stat="identity", fill="#FF0066")+
  theme_classic()+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90)+
  theme(axis.text.x=element_text(size=rel(0.9)))

Selected variables from mixKernel at genus level for 18-S are:

set.seed(17051753)
#Permuting 18S
kernel.pca.resultM18 = kernel.pca.permute(kernel.pca.result, ncomp = 1, M18 = colnames(Dataset$'18_S'))
M18relevance <- data.frame(kernel.pca.resultM18$cc.variables, kernel.pca.resultM18$cc.distances)
colnames(M18relevance) <- c("OTU", "Relevance")
M18relevance <- M18relevance[order(-M18relevance[,2]),] 
Rel<-M18relevance[1:11,]
melted_Table <- melt(table(Ruminotypes$Taxa_18S[Ruminotypes$Taxa_18S$OTUID%in%Rel$OTU,]$Genus))
melted_Table <-melted_Table[melted_Table$value>0,]

ggplot(data =melted_Table, aes(x= reorder(Var.1, -value), y=value)) +
  scale_fill_gradient(low = "steelblue",
                      high = "#FF0066",na.value="steelblue") +
  geom_bar(stat="identity", fill="#FF0066")+
  theme_classic()+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90)+
  theme(axis.text.x=element_text(size=rel(0.9)))

Selected variables from mixKernel at genus level for Archaea are:

set.seed(17051753)
kernel.pca.resultArch = kernel.pca.permute(kernel.pca.result, ncomp = 1, Arch = colnames(Dataset$'Archaea'))
Archrelevance <- data.frame(kernel.pca.resultArch$cc.variables, kernel.pca.resultArch$cc.distances)
colnames(Archrelevance ) <- c("OTU", "Relevance")
Archrelevance  <- Archrelevance [order(-Archrelevance [,2]),] 

Rel<-Archrelevance[1:45,]
melted_Table <- melt(table(Ruminotypes$Taxa_Archaea[which(Ruminotypes$Taxa_Archaea$OTU%in%Rel$OTU),]$Genus))
melted_Table <-melted_Table[melted_Table$value>0,]
#remove one row because it is not classify at genus level
melted_Table<-melted_Table[-1,]
ggplot(data =melted_Table, aes(x= reorder(Var.1, -value), y=value)) +
  scale_fill_gradient(low = "steelblue",
                      high = "#FF0066",na.value="steelblue") +
  geom_bar(stat="identity", fill="#FF0066")+
  theme_classic()+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90)+
  theme(axis.text.x=element_text(size=rel(0.9)))

We also compared our results with mixKernel output from Statis and full-UMKL kernels.

options(width =50)
M16.kernel = compute.kernel(Dataset$'16_S')
Arch.kernel = compute.kernel(Dataset$'Archaea')
M18.kernel = compute.kernel(Dataset$'18_S')


#FULL-Statis
meta.kernel = combine.kernels(M16 = M16.kernel, M18 = M18.kernel, Arch = Arch.kernel, method = "full-UMKL")
kernel.pca.result = kernel.pca(meta.kernel, ncomp = 10)

The correlations between compromise configuration from mixkernel and Link-HD were obtained using procrustes methodology from the vegan library.

mixkCompromise <- as.matrix(meta.kernel$"kernel")
linkhdCompromise <- as.matrix(Output@Compromise.Matrix)

protest(mixkCompromise, linkhdCompromise)
## 
## Call:
## protest(X = mixkCompromise, Y = linkhdCompromise) 
## 
## Procrustes Sum of Squares (m12 squared):        0.1515 
## Correlation in a symmetric Procrustes rotation: 0.9211 
## Significance:  0.001 
## 
## Permutation: free
## Number of permutations: 999

Selected variables from mixKernel at family level for 16-S are:

set.seed(17051753)
#Permuting Bacteria
kernel.pca.resultM16 = kernel.pca.permute(kernel.pca.result, ncomp = 1, M16 = colnames(Dataset$'16_S'))
M16relevance <- data.frame(kernel.pca.resultM16$cc.variables, kernel.pca.resultM16$cc.distances)
colnames(M16relevance) <- c("OTU", "Relevance")
M16relevance <- M16relevance[order(-M16relevance[,2]),] 
Rel<-M16relevance[1:120,]
melted_Table <- melt(table(Ruminotypes$Taxa_16S[Ruminotypes$Taxa_16S$OTUID%in%Rel$OTU,]$Family))
ggplot(data =melted_Table, aes(x= reorder(Var.1, -value), y=value)) +
  scale_fill_gradient(low = "steelblue",
                      high = "#FF0066",na.value="steelblue") +
  geom_bar(stat="identity", fill="#FF0066")+
  theme_classic()+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90)+
  theme(axis.text.x=element_text(size=rel(0.9)))

Selected variables from mixKernel at genus level for 18-S are:

set.seed(17051753)
#Permuting 18S
kernel.pca.resultM18 = kernel.pca.permute(kernel.pca.result, ncomp = 1, M18 = colnames(Dataset$'18_S'))
M18relevance <- data.frame(kernel.pca.resultM18$cc.variables, kernel.pca.resultM18$cc.distances)
colnames(M18relevance) <- c("OTU", "Relevance")
M18relevance <- M18relevance[order(-M18relevance[,2]),] 
Rel<-M18relevance[1:11,]
melted_Table <- melt(table(Ruminotypes$Taxa_18S[Ruminotypes$Taxa_18S$OTUID%in%Rel$OTU,]$Genus))
melted_Table <-melted_Table[melted_Table$value>0,]

ggplot(data =melted_Table, aes(x= reorder(Var.1, -value), y=value)) +
  scale_fill_gradient(low = "steelblue",
                      high = "#FF0066",na.value="steelblue") +
  geom_bar(stat="identity", fill="#FF0066")+
  theme_classic()+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90)+
  theme(axis.text.x=element_text(size=rel(0.9)))

Selected variables from mixKernel at genus level for Archaea are:

set.seed(17051753)
kernel.pca.resultArch = kernel.pca.permute(kernel.pca.result, ncomp = 1, Arch = colnames(Dataset$'Archaea'))
Archrelevance <- data.frame(kernel.pca.resultArch$cc.variables, kernel.pca.resultArch$cc.distances)
colnames(Archrelevance ) <- c("OTU", "Relevance")
Archrelevance  <- Archrelevance [order(-Archrelevance [,2]),] 

Rel<-Archrelevance[1:45,]
melted_Table <- melt(table(Ruminotypes$Taxa_Archaea[which(Ruminotypes$Taxa_Archaea$OTU%in%Rel$OTU),]$Genus))
melted_Table <-melted_Table[melted_Table$value>0,]
#remove one row because it is not classify at genus level
melted_Table<-melted_Table[-1,]
ggplot(data =melted_Table, aes(x= reorder(Var.1, -value), y=value)) +
  scale_fill_gradient(low = "steelblue",
                      high = "#FF0066",na.value="steelblue") +
  geom_bar(stat="identity", fill="#FF0066")+
  theme_classic()+
  theme( axis.ticks = element_blank(),axis.title.x=element_blank(),axis.title.y = element_blank())+
  ggpubr::rotate_x_text(90)+
  theme(axis.text.x=element_text(size=rel(0.9)))

Despite the agreement between selected variables from mixKernel and Link-HD, their importance is not the same. The discrepancy can be explained by the way in which each package computes the variable weights. Whereas mixkernel only counts the number of OTUs into a genus/family, Link-HD implements a hypergeometric test to establish whether a group of selected OTUs is ‘enriched’ for a particular taxonomic level (i.e., family or genus).

Alternative omics layers: NCI-60 transcriptome datasets

The following example consists in the illustrative analysis of a subset of NCI-60 cell line panel (Shankavaram et al. 2009,Reinhold et al. (2012)), which is widely used in anti-cancer drug screening. The whole dataset contains 60 human tumour cell lines derived from patients with Leukemia, Melanoma, Lung, Colon, Central Nervous System, Ovarian, Renal, Breast and Prostate cancers. The gene expression of these cell lines was measured using four different microarrays platforms: Agilent mRNA (Whole Human Genome Microarray, \(4 \times 44K\)) and Affymetrix (HGU 95, HGU 133 and HGU 133plus 2.0). The toy dataset we use here is freely available in the omicade4 package (Meng et al. 2013).

library(omicade4)
data(NCI60_4arrays)
#check data dimensions
lapply(NCI60_4arrays, dim)
## $agilent
## [1] 300  60
## 
## $hgu133
## [1] 298  60
## 
## $hgu133p2
## [1] 268  60
## 
## $hgu95
## [1] 288  60

As Link-HD requires common elements in rows, matrices need to be transpose:

data<-lapply(NCI60_4arrays, function(x){t(x)})
lapply(data,dim)
## $agilent
## [1]  60 300
## 
## $hgu133
## [1]  60 298
## 
## $hgu133p2
## [1]  60 268
## 
## $hgu95
## [1]  60 288
Output<-LinkData(data,Distance=rep("euclidean",3),Scale = TRUE,Center=TRUE,nCluster = 9)
#variance explained by compromise coordinates
CompromisePlot(Output)+theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour="black"))

Note that Link-HD proportion of explained variance by the two first components (PC1=17.11% and PC=12.73%) is simmilar to the one reported by (Meng et al. 2013).

In order to better undestand the samples structure, we can color them by tumoral tissues:

cancer_type <- rownames(data$agilent)
#To select strings before the "."
cancer_type <- sapply(strsplit(cancer_type, split="\\."),function(x){x[1]})
mydata=Output@Compromise.Coords
#set colors
cols <- c("#ffcc00","#ff0066","#00cc99","#0000ff","#ffff00","#ffb3b3","#1affa3","#660033","#26004d")
mydata$fit.cluster<-as.factor(mydata$fit.cluster)
p<-ggplot(mydata, aes(Dim.1,Dim.2)) +
geom_point(aes(colour = cancer_type))
p + scale_colour_manual(values = cols) + theme_classic() 

Meanwhile, the next figure represents the color according to the Link-HD clusters assignation:

mydata=Output@Compromise.Coords
#set colors
cols <- c("#ffcc00","#ff0066","#00cc99","#0000ff","#ffff00","#ffb3b3","#1affa3","#660033","#26004d")
mydata=Output@Compromise.Coords
mydata$fit.cluster<-as.factor(mydata$fit.cluster)
p<-ggplot(mydata, aes(Dim.1,Dim.2)) +
geom_point(aes(colour = fit.cluster))
p + scale_colour_manual(values = cols) + theme_classic() 

Link-HD can obtaining clusters from the integrated datasets. We have implemented our method picking out nine (the number of tissues in NCI-60 dataset) and three clusters, respectively. In the first case, Link-HD recovers the structure of melanoma, CNS and Renal tissues. For the three clusters structure, the first axis divides the tissues into two groups with leukemia, colon and two breast cancer tissues (yellow in the plot below) and the rest of carcinoma lines (pink). Finally, the second axis splits these carcinoma and leukemia cell-lines from melanoma. The structure of the first axis (which explains the greatest variation (17,11%),Dim.1) is associated with the separation of epithelial and mesenchymal mechanisms, which in turn it is related with the cancer prognosis (Meng et al. 2013).

Output<-LinkData(data,Distance=rep("euclidean",3),Scale = TRUE,Center=TRUE,nCluster = 3)
mydata=Output@Compromise.Coords
#set colors
cols <- c("#ffcc00","#ff0066","#00cc99")
mydata=Output@Compromise.Coords
mydata$fit.cluster<-as.factor(mydata$fit.cluster)
p<-ggplot(mydata, aes(Dim.1,Dim.2)) +
geom_point(aes(colour = fit.cluster))
p + scale_colour_manual(values = cols) + theme_classic() 

The RV plot shows that agilent data has a different profile in respect with the other 3 microarrays platforms (all the RV correlations between agilent and the other 3 datasets are less than 0.8).

CorrelationPlot(Output)

We also apply Select_Var function in order to select the most important variables, responsible of the obtained structure.

Select_Var<-VarSelection(Output,Data=data,Crit = "Rsquare",perc=0.9)
Select_Var@Variables
##   [1] "LHFP"       "WWC2"       "LEPREL1"   
##   [4] "PVR"        "VAMP3"      "LOC344978" 
##   [7] "BACE1"      "FERMT1"     "FKBP10"    
##  [10] "TBC1D2"     "SYNM"       "KRT8P23"   
##  [13] "HERC1"      "SC65"       "ANXA3"     
##  [16] "MLF1"       "PTGES"      "TOR2A"     
##  [19] "OCLN"       "PPIC"       "GPC6"      
##  [22] "GMFG"       "ETV3"       "C9orf30"   
##  [25] "C19orf39"   "S100B"      "PTPN21"    
##  [28] "SMAD3"      "BACE2"      "GNPTAB"    
##  [31] "CA12"       "HSPBAP1"    "ARAP3"     
##  [34] "ZFP36"      "PDLIM2"     "DUSP4"     
##  [37] "FAM57A"     "HELZ"       "ECHDC2"    
##  [40] "SPEG"       "C1orf131"   "ATP1A3"    
##  [43] "C10orf90"   "C3orf59"    "PRKD1"     
##  [46] "F3"         "CA14"       "CPNE7"     
##  [49] "FGD3"       "RAB32"      "KAT2B"     
##  [52] "COPS7A"     "S100A1"     "ERBB2"     
##  [55] "GP9"        "SERPINF1"   "DUSP3"     
##  [58] "THAP10"     "LOC729732"  "CTNS"      
##  [61] "STEAP2"     "EPHA4"      "HDGFRP3"   
##  [64] "FGF1"       "ETS1"       "MEF2C"     
##  [67] "VEPH1"      "TCEAL2"     "STX5"      
##  [70] "VGF"        "ST8SIA1"    "MORN2"     
##  [73] "LOC442157"  "FAM69B"     "GDPD5"     
##  [76] "ANKRD9"     "NAAA"       "EDC4"      
##  [79] "C4orf32"    "STX1A"      "LOC441862" 
##  [82] "LOC728855"  "2.HOXB2"    "GCH1"      
##  [85] "VAV1"       "GPNMB"      "2.C6orf218"
##  [88] "S100B"      "S100A1"     "GPR137B"   
##  [91] "2.CSPG4"    "EGFR"       "2.DKK3"    
##  [94] "CD3D"       "2.LAMA4"    "GMFG"      
##  [97] "2.SPARC"    "2.KLHL6"    "2.RHOH"    
## [100] "CPN1"       "2.GAPDHS"   "EDNRB"     
## [103] "2.POU3F2"   "2.EGFR"     "LYST"      
## [106] "SERPINF1"   "HOXB2"      "C6orf218"  
## [109] "CSPG4"      "DKK3"       "LAMA4"     
## [112] "SPARC"      "KLHL6"      "RHOH"      
## [115] "GAPDHS"     "POU3F2"     "EGFR"      
## [118] "MICAL2"     "PLP1"       "SOX10"     
## [121] "ACP5"       "S100A1"     "TRPV2"     
## [124] "CITED1"     "FABP7"      "MAGEA3"

We compared The list of selected genes against the cancer related markers reported by (Meng et al. 2013). 37.61% (41/109) of the Link-HD selected genes were also reported in (Meng et al. 2013). Between them, we remark 20 melanoma-related markers (MAGEA3, S100A1, S100B, GPNMB, CSPG4,POU3F2,C10orf90,VGF,SERPINF1,GAPDHS,ACP5,CA14,ST8SIA1,FABP7,C6orf218,GPR137B,EDNRB,LYST,SOX10,TRPV2) and five leukemia markers (GMFG, FGD3, VAV1, ATP1A3, RHOH).

Finally, to gain insight into the overrepresented biological processes, we carry out a classical set enrichment analyses, Gene Ontology (GO) approach by using the Functional Gene Networks (FGNet) Bioconductor package (Aibar et al. 2015).

library(FGNet)
library("org.Hs.eg.db")
List<-union(union(union(toupper(colnames(data[[1]])),toupper(colnames(data[[2]]))),toupper(colnames(data[[3]]))),toupper(colnames(data[[4]])))
##hay 22756 genes distintos

geneLabels<-List
feaResults_topGO <- fea_topGO(Select_Var@Variables,genesUniverse = List ,geneIdType="SYMBOL", organism="Hs") 
feaResults <- feaResults_topGO
incidMat <- fea2incidMat(feaResults)
functionalNetwork(incidMat, plotTitleSub="Selected Genes Relationships",legendText = FALSE)

The Gene Enrichment Analysis returns a list with the top Gene Ontology (GO) terms. Between the top 5 of enriched biological processes, we can highlight: Angiogenesis Regulation, which is essential for tumor development and metastasis, Epithelial cell proliferation and Regulation of epithelial cell proliferation, that are involve in colonorectal neoplasms. External tools for Gene Enrichment Analysis are also available. For instance,the results from IPA (Ingenuity Pathway Analysis) reveals several cancer - related pathways, such as the Regulation of the Epithelial-Mesenchymal Transition (EMT), which is crucial in the metastasis development. Moreover, biological functions related with Carcinomas and digestive cancer (Organismal Injury and Abnormalities Carcinoma,Gastrointestinal Disease, Organismal Injury and Abnormalities Digestive system cancer) were also identified.

Final Remarks

In summary, we have developed Link-HD, an R package to integrate heterogeneous variable datasets. It combines some well-known multivariate techniques to perform data integration with cluster, variable selection and network analysis. We show the utility of Link-HD to integrate heterogeneous datasets. Applied to data from the TARA Ocean expedition, Link-HD reproduces the relevance of members of Proteobacteria phyla and temperature on the functionality and stratification of this ecosystem. Regarding the rumen microbiota, the consensus space inferred by Link-HD shows the association of the structure of the rumen microbiome with \(CH_4\) emission. The method also facilitates the study of the relationships between observations and features for performing variable selection. In this sense, a good performance of Link-HD with other available integrative methodologies, such as mixKernel was demonstrated. Finally, as is illustrated in the last example, our approach can also be used to integrate and explore others kind of ‘omics’ data.

Appendix

List of LinkHD dependencies:

methods,
scales,
ggplot2,
dplyr,
cluster,
graphics, 
ggpubr, 
reshape,
gridExtra,
vegan,
data.table,
rio, 
MultiAssayExperiment

References

Aibar, Sara, Celia Fontanillo, Conrad Droste, and Javier De Las Rivas. 2015. “Functional Gene Networks: R/Bioc Package to Generate and Analyze Gene Networks Derived from Functional Enrichment and Clustering.” Bioinformatics 31 (10): 1686–8. doi:10.1093/bioinformatics/btu864.

Aitchison, John. 1982. “The Statistical Analysis of Compositional Data.” Journal of the Royal Statistical Society: Series B (Methodological) 44 (2). Wiley Online Library: 139–60.

Aitchison, John, Jim W Kay, and others. 2003. “Possible Solution of Some Essential Zero Problems in Compositional Data Analysis.” Universitat de Girona. Departament d’Informàtica i Matemàtica Aplicada.

Argelaguet, Ricard, Britta Velten, Damien Arnol, Sascha Dietrich, Thorsten Zenz, John C Marioni, Florian Buettner, Wolfgang Huber, and Oliver Stegle. 2018. “Multi-Omics Factor Analysis—a Framework for Unsupervised Integration of Multi-Omics Data Sets.” Molecular Systems Biology 14 (6). EMBO Press: e8124.

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society: Series B (Methodological) 57 (1). Wiley Online Library: 289–300.

Berry, David, and Stefanie Widder. 2014. “Deciphering Microbial Interactions and Detecting Keystone Species with Co-Occurrence Networks.” Frontiers in Microbiology 5. Frontiers: 219.

Danielsson, Rebecca, Johan Dicksved, Li Sun, Horacio Gonda, Bettina Müller, Anna Schnürer, and Jan Bertilsson. 2017. “Methane Production in Dairy Cows Correlates with Rumen Methanogenic and Bacterial Community Structure.” Frontiers in Microbiology 8. Frontiers: 226.

Escoufier, Y, P Cazes, and others. 1976. “Opérateurs et Analyse Des Tableaux à Plus de Deux Dimensions.” Cahiers Du Bureau Universitaire de Recherche Opérationnelle 25: 61–89.

Faust, Karoline, and Jeroen Raes. 2012. “Microbial Interactions: From Networks to Models.” Nature Reviews Microbiology 10 (8). Nature Publishing Group: 538.

Gabriel, K. R. 1971. “The Biplot Graphic Display of Matrices with Application to Principal Component Analysis.” Biometrika 58 (3): 453. doi:10.2307/2334381.

Gower, John C, and David J Hand. 1995. Biplots. Vol. 54. CRC Press.

Greenacre, Michael J. 2010. Biplots in Practice. Fundacion BBVA.

Kittelmann, Sandra, Henning Seedorf, William A Walters, Jose C Clemente, Rob Knight, Jeffrey I Gordon, and Peter H Janssen. 2013. “Simultaneous Amplicon Sequencing to Explore Co-Occurrence Patterns of Bacterial, Archaeal and Eukaryotic Microorganisms in Rumen Microbial Communities.” PloS One 8 (2). Public Library of Science: e47879.

Kruskal, William H, and W Allen Wallis. 1952. “Use of Ranks in One-Criterion Variance Analysis.” Journal of the American Statistical Association 47 (260). Taylor & Francis Group: 583–621.

Kurtz, Zachary D, Christian L Müller, Emily R Miraldi, Dan R Littman, Martin J Blaser, and Richard A Bonneau. 2015. “Sparse and Compositionally Robust Inference of Microbial Ecological Networks.” PLoS Computational Biology 11 (5). Public Library of Science: e1004226.

Lavit, Christine. 1988. “Presentation de La Méthode Statis Permettant L’analyse Conjointe de Plusieurs Tableaux de Données Quantitatives.” Cachiers de La Recherche Dévelopment 18: 49–60.

Lavit, Christine, Yves Escoufier, Robert Sabatier, and Pierre Traissac. 1994. “The ACT (STATIS method).” Computational Statistics & Data Analysis 18 (1): 97–119. doi:10.1016/0167-9473(94)90134-1.

Lima-Mendez, Gipsi, Karoline Faust, Nicolas Henry, Johan Decelle, Sébastien Colin, Fabrizio Carcillo, Samuel Chaffron, et al. 2015. “Determinants of Community Structure in the Global Plankton Interactome.” Science 348 (6237). American Association for the Advancement of Science: 1262073.

Mariette, Jérôme, and Nathalie Villa-Vialaneix. 2017. “Unsupervised Multiple Kernel Learning for Heterogeneous Data Integration.” Bioinformatics 34 (6). Oxford University Press: 1009–15.

Meng, Chen, Bernhard Kuster, Aedin Culhane, and Amin Moghaddas Gholami. 2013. “A Multivariate Approach to the Integration of Multi-Omics Datasets.” BMC Bioinformatics.

Park, Hae-Sang, and Chi-Hyuck Jun. 2009. “A Simple and Fast Algorithm for K-Medoids Clustering.” Expert Systems with Applications 36 (2). Elsevier: 3336–41.

Paulson, Joseph N, O Colin Stine, Héctor Corrada Bravo, and Mihai Pop. 2013. “Differential Abundance Analysis for Microbial Marker-Gene Surveys.” Nature Methods 10 (12). Nature Publishing Group: 1200.

Pawlowsky-Glahn, Vera, and Antonella Buccianti. 2011. Compositional Data Analysis. Wiley Online Library.

Plantes, Henri L’Hermier des. 1976. “Structuration Des Tableaux à Trois Indices de La Statistique: Théorie et Application d’une Méthode d’analyse Conjointe.” PhD thesis, Université des sciences et techniques du Languedoc.

Ramayo-Caldas, Yuliaxis, Nuria Mach, Patricia Lepage, Florence Levenez, Catherine Denis, Gaetan Lemonnier, Jean-Jacques Leplat, et al. 2016. “Phylogenetic Network Analysis Applied to Pig Gut Microbiota Identifies an Ecosystem Structure Linked with Growth Traits.” The ISME Journal 10 (12). Nature Publishing Group: 2973.

Ramayo-Caldas, Yuliaxis, Laura Zingaretti, Milka Popova, Jordi Estellé, Aurelien Bernard, Nicolas Pons, Pau Bellot, et al. 2019. “Identification of Rumen Microbial Biomarkers Linked to Methane Emission in Holstein Dairy Cows.” Journal of Animal Breeding and Genetics. Wiley Online Library.

Reinhold, William C, Margot Sunshine, Hongfang Liu, Sudhir Varma, Kurt W Kohn, Joel Morris, James Doroshow, and Yves Pommier. 2012. “CellMiner: a web-based suite of genomic and pharmacologic tools to explore transcript and drug patterns in the NCI-60 cell line set.” Cancer Research 72 (14): 3499–3511. doi:10.1158/0008-5472.CAN-12-1370.

Robert, Paul, and Yves Escoufier. 1976. “A unifying tool for linear multivariate statistical methods: the RV-coefficient.” Applied Statistics. JSTOR, 257–65.

Rohart, Florian, Benoit Gautier, Amrit Singh, and Kim-Anh Lê Cao. 2017. “MixOmics: An R Package for ‘Omics Feature Selection and Multiple Data Integration.” PLoS Computational Biology 13 (11). Public Library of Science: e1005752.

Rousseeuw, Peter J. 1987. “Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis.” Journal of Computational and Applied Mathematics 20. Elsevier: 53–65.

Shankavaram, Uma T, Sudhir Varma, David Kane, Margot Sunshine, Krishna K Chary, William C Reinhold, Yves Pommier, and John N Weinstein. 2009. “CellMiner: a relational database and query tool for the NCI-60 cancer cell lines.” BMC Genomics 10 (January): 277. doi:10.1186/1471-2164-10-277.